Collection of Internet

home *** CD-ROM | disk | FTP | other *** search

/ Collection of Internet / Collection of Internet.iso / infosrvr / dev / www_talk.930 / 001513_daemon _Wed Jun 30 22:34:59 1993.msg < prev

Wrap

Internet Message Format | 1994-01-24 | 3KB

Received: by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA13108; Wed, 30 Jun 93 22:35:01 MET DST Return-Path: <mkgray@athena.mit.edu> Received: from dxmint.cern.ch by nxoc01.cern.ch (NeXT-1.0 (From Sendmail 5.52)/NeXT-2.0) id AA13104; Wed, 30 Jun 93 22:34:59 MET DST Received: from ATHENA-AS-WELL.MIT.EDU by dxmint.cern.ch (5.65/DEC-Ultrix/4.3) id AA11512; Wed, 30 Jun 1993 22:58:22 +0200 Received: from URANUS.MIT.EDU by Athena.MIT.EDU with SMTP id AA17474; Wed, 30 Jun 93 16:58:20 EDT From: mkgray@athena.mit.edu Received: by uranus.MIT.EDU (AIX 3.2/UCB 5.64/4.7) id AA23487; Wed, 30 Jun 1993 16:58:18 -0400 Message-Id: <9306302058.AA23487@uranus.MIT.EDU> To: sanders@bsdi.com Cc: www-talk@nxoc01.cern.ch Subject: Re: searchable index of the web In-Reply-To: Your message of Wed, 30 Jun 93 15:30:47 -0500. <9306302030.AA09977@austin.BSDI.COM> Date: Wed, 30 Jun 93 16:58:15 EDT Ok, how "big" is the Web. Here is what W4 has found out. Actually, first I'd better explain a little bit about what the wanderer does. It does a simple depth first search, with an added feature I call 'getting bored'. That is, if it finds a number of documents that have the same URL, up to the last field (eg http://foo/bar/blah, http://foo/bar/baz, http://foo/bar/more) it will eventually get 'bored' and skip it. This makes it go a little quicker. Of course, it potentially is losing some documents here, but probably not. W4 took many hours (maybe 20) to run, but I don't remember exactly, because it saves state so I could kill it and restart it whenever I wanted. Well, in total, the W4 found more than 17,000 http documents (didn't follow any other kinds of links) and more than 125 unique hosts. In the current version, it *only* retrieved the URL of the document. In the next version, I hope to have it do the following other things. o Get the <title>Title</title> of the document o Get the length of the document o Do a 'keyword' analysis of the document o Count the number of links in a document o Improve on the boredom system By a 'keyword' analysis, I mean looking at the document for words that appear frequently, but aren't normally common words. Additionally, titles and things appearing in headers would be good candidates for keyword searches. I'll try and get the current code at least clean enough that I'm willing to let everyone in the world to see it, but if you *really* want to see it now, send me mail. Any other suggestions would be welcome. Once this index is produced, it will be searchable via http, and I suppose by WAIS though I really detest the way WAIS restricts searches. In any case, there is a possibility that this will be done by the end of the summer. Matthew Gray mkgray@athena.mit.edu